[CORRUPTED] Synthetic Benchmark PR #138126 - Fix stats performance #30

tomerqodo · 2025-12-04T20:37:32Z

User description

Benchmark PR elastic#138126

Type: Corrupted (contains bugs)

Original PR Title: Fix stats performance
Original PR Description: This fixes the N^2 performance problem described in elastic#97222. In addition to restoring the previous partial fix (elastic#130857), it does the following:

IndicesQueryCache::getStats now accepts a Supplier so that we can only call IndicesQueryCache::getSharedRamSizeForAllShards if it is absolutely needed. This fixes an N^2 performance problem that Improving statsByShard performance when the number of shards is very large elastic/elasticsearch#130857 introduced. If a user called TransportIndicesStatsAction but did not request query cache stats, then before Improving statsByShard performance when the number of shards is very large elastic/elasticsearch#130857 we did not enter the N^2 loop (it was only entered if a user did request query cache stats). But after Improving statsByShard performance when the number of shards is very large elastic/elasticsearch#130857, we had the N^2 performance all the time. This is a pretty big problem for clusters with large shards since this is called very frequently (including every 30 seconds by a background task).
It fixes the N^2 performance in TransportIndicesStatsAction by sharing state across all shardOperation calls on a single node using the new NodeContext feature from Adding NodeContext to TransportBroadcastByNodeAction elastic/elasticsearch#138057.

Closes elastic#97222
Original PR URL: elastic#138126

PR Type

Bug fix, Enhancement

Description

Fixes N^2 performance problem in stats APIs by using Supplier pattern for query cache stats computation
Introduces CacheTotals record and refactors shared RAM calculation to avoid redundant iterations
Implements NodeContext in TransportBroadcastByNodeAction to share cache state across shard operations
Updates CommonStats.getShardLevelStats to accept precomputed shared RAM supplier parameter
Refactors IndicesQueryCache with new static methods for computing cache totals and shared RAM per shard

Diagram Walkthrough

flowchart LR
  A["Stats API Requests"] -->|"uses Supplier pattern"| B["CommonStats.getShardLevelStats"]
  B -->|"precomputed shared RAM"| C["IndicesQueryCache.getStats"]
  D["TransportIndicesStatsAction"] -->|"creates NodeContext"| E["CachedSupplier"]
  E -->|"computes once per node"| F["getCacheTotalsForAllShards"]
  F -->|"distributes shared RAM"| G["getSharedRamSizeForShard"]
  G -->|"avoids N^2 loop"| H["Performance Improvement"]

File Walkthrough

Relevant files

Enhancement

5 files

IndicesQueryCache.java `Refactor cache stats computation with new static methods`	+82/-34
CommonStats.java `Add precomputed shared RAM supplier parameter to getShardLevelStats`	+8/-2
TransportIndicesStatsAction.java `Implement NodeContext for sharing cache state across shards`	+20/-3
TransportClusterStatsAction.java `Use CachedSupplier to defer query cache computation`	+16/-3
IndicesService.java `Precompute shared RAM map before iterating shards`	+16/-8

Tests

5 files

IndicesQueryCacheTests.java `Update tests to pass precomputed shared RAM supplier`	+129/-82
VersionStatsTests.java `Update test to pass shared RAM supplier parameter`	+1/-1
IndexShardTests.java `Update test to pass shared RAM supplier parameter`	+1/-1
IndicesServiceCloseTests.java `Update test to pass shared RAM supplier parameter`	+7/-7
IndicesServiceTests.java `Update mocks to handle new precomputed shared RAM parameter`	+10/-4

Documentation

1 files

138126.yaml `Add changelog entry for stats performance fix`	+6/-0

… is very large (elastic#130857)" (elastic#137973) (elastic#137984) This reverts commit 391de08.

… to avoid the N^2 performance in TransportIndicesStatsAction if the user did not ask for query cache stats (although it is still there if the user asks for query cache stats). It also avoid O(N) performance in TransportClusterStatsAction if the user did not ask for query cache stats

…performance when a user asks for query cache stats

Co-authored-by: Copilot <[email protected]>

qodo-code-review · 2025-12-04T20:38:03Z

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
🟢	No security concerns identified No security vulnerabilities detected by AI analysis. Human verification advised for critical code.
Ticket Compliance
⚪	🎫 No ticket provided Create ticket/issue
Codebase Duplication Compliance
⚪	Codebase context is not defined Follow the guide to enable codebase context checks.
Custom Compliance
🟢	Generic: Meaningful Naming and Self-Documenting Code Objective: Ensure all identifiers clearly express their purpose and intent, making code self-documenting Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Secure Error Handling Objective: To prevent the leakage of sensitive system information through error messages while providing sufficient detail for internal debugging. Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Secure Logging Practices Objective: To ensure logs are useful for debugging and auditing without exposing sensitive information like PII, PHI, or cardholder data. Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Security-First Input Validation and Data Handling Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent vulnerabilities Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
⚪	Generic: Comprehensive Audit Trails Objective: To create a detailed and reliable record of critical system actions for security analysis and compliance. Status: Audit Logging: The new cache computation and stats retrieval logic performs critical actions (e.g., computing and distributing shared RAM usage) without evident audit logging of access or changes, but visibility into the broader logging framework is limited in this diff. Referred Code public static Map<ShardId, Long> getSharedRamSizeForAllShards(IndicesService indicesService) { Map<ShardId, Long> shardIdToSharedRam = new HashMap<>(); IndicesQueryCache.CacheTotals cacheTotals = IndicesQueryCache.getCacheTotalsForAllShards(indicesService); for (IndexService indexService : indicesService) { for (IndexShard indexShard : indexService) { final var queryCache = indicesService.getIndicesQueryCache(); long sharedRam = (queryCache == null) ? 0L : queryCache.getSharedRamSizeForShard(indexShard.shardId(), cacheTotals); // as a size optimization, only store non-zero values in the map if (sharedRam > 0L) { shardIdToSharedRam.put(indexShard.shardId(), sharedRam); } } } return Collections.unmodifiableMap(shardIdToSharedRam); } public long getCacheSizeForShard(ShardId shardId) { Stats stats = shardStats.get(shardId); return stats != null ? stats.cacheSize : 0L; } ... (clipped 4 lines) Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Robust Error Handling and Edge Case Management Objective: Ensure comprehensive error handling that provides meaningful context and graceful degradation Status: Edge Cases: The new Supplier-based shared RAM computation may return null or zero values for shards, and while defaults are used in places, comprehensive handling of failures from supplier evaluation (e.g., exceptions or empty maps) is not evident from the diff. Referred Code @Override protected void shardOperation( IndicesStatsRequest request, ShardRouting shardRouting, Task task, Supplier<IndicesQueryCache.CacheTotals> context, ActionListener<ShardStats> listener ) { ActionListener.completeWith(listener, () -> { assert task instanceof CancellableTask; IndexService indexService = indicesService.indexServiceSafe(shardRouting.shardId().getIndex()); IndexShard indexShard = indexService.getShard(shardRouting.shardId().id()); CommonStats commonStats = CommonStats.getShardLevelStats( indicesService.getIndicesQueryCache(), indexShard, request.flags(), () -> { final IndicesQueryCache queryCache = indicesService.getIndicesQueryCache(); IndicesQueryCache.CacheTotals freshTotals = IndicesQueryCache.getCacheTotalsForAllShards(indicesService); return (queryCache == null) ? 0L : queryCache.getSharedRamSizeForShard(indexShard.shardId(), freshTotals); } ... (clipped 2 lines) Learn more about managing compliance generic rules or creating your own custom rules

Compliance status legend

🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

qodo-code-review · 2025-12-04T20:39:08Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Fix division-by-zero error in calculation Fix a division-by-zero error in the shared RAM calculation. When `totalItemsInCache` is zero, the code should divide by `shardCount` instead to correctly apportion memory usage. server/src/main/java/org/elasticsearch/indices/IndicesQueryCache.java [177-180] if (totalItemsInCache == 0) { // all shards have zero cache footprint, so we apportion the size of the shared bytes equally across all shards - additionalRamBytesUsed = Math.round((double) sharedRamBytesUsed / totalItemsInCache); + additionalRamBytesUsed = Math.round((double) sharedRamBytesUsed / shardCount); } else { Apply / Chat Suggestion importance[1-10]: 9 __ Why: The suggestion correctly identifies a division-by-zero bug introduced in the PR, where `totalItemsInCache` is used as a divisor within a conditional block that only executes when `totalItemsInCache` is zero.	High
Possible issue	Use cached context to improve performance Improve performance by using the cached `context` in `shardOperation`. The current implementation re-calculates `CacheTotals` for each shard, negating the performance benefit of the newly introduced caching mechanism. server/src/main/java/org/elasticsearch/action/admin/indices/stats/TransportIndicesStatsAction.java [135-144] CommonStats commonStats = CommonStats.getShardLevelStats( indicesService.getIndicesQueryCache(), indexShard, request.flags(), () -> { final IndicesQueryCache queryCache = indicesService.getIndicesQueryCache(); - IndicesQueryCache.CacheTotals freshTotals = IndicesQueryCache.getCacheTotalsForAllShards(indicesService); - return (queryCache == null) ? 0L : queryCache.getSharedRamSizeForShard(indexShard.shardId(), freshTotals); + IndicesQueryCache.CacheTotals totals = context.get(); + return (queryCache == null) ? 0L : queryCache.getSharedRamSizeForShard(indexShard.shardId(), totals); } ); Apply / Chat Suggestion importance[1-10]: 8 __ Why: The suggestion correctly points out that the implementation fails to use the pre-computed context, instead re-calculating expensive totals for every shard, which defeats the purpose of the performance optimization introduced in the PR.	Medium
More

masseyke and others added 11 commits November 14, 2025 17:10

Reapply "Improving statsByShard performance when the number of shards…

a3cf3f2

… is very large (elastic#130857)" (elastic#137973) (elastic#137984) This reverts commit 391de08.

Using the NodeContext in TransportBroadcastByNodeAction to avoid N^2 …

f516df1

…performance when a user asks for query cache stats

Update docs/changelog/138126.yaml

6ade46d

Removing unused method and fixing tests

c1903f9

Merge branch 'main' into fix-stats-performance

3ec4474

Apply suggestions from code review

ec707e5

Co-authored-by: Copilot <[email protected]>

Merge branch 'main' into fix-stats-performance

98b763d

Cleaning up changelog

df90b96

Merge branch 'main' into fix-stats-performance

fc029a5

Apply changes for benchmark PR

48578b5

qodo-code-review bot added the Review effort 3/5 label Dec 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CORRUPTED] Synthetic Benchmark PR #138126 - Fix stats performance #30

[CORRUPTED] Synthetic Benchmark PR #138126 - Fix stats performance #30

Uh oh!

tomerqodo commented Dec 4, 2025 •

edited by qodo-code-review bot

Loading

Uh oh!

qodo-code-review bot commented Dec 4, 2025

Uh oh!

qodo-code-review bot commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[CORRUPTED] Synthetic Benchmark PR #138126 - Fix stats performance #30

Are you sure you want to change the base?

[CORRUPTED] Synthetic Benchmark PR #138126 - Fix stats performance #30

Uh oh!

Conversation

tomerqodo commented Dec 4, 2025 • edited by qodo-code-review bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

Benchmark PR elastic#138126

PR Type

Description

Diagram Walkthrough

File Walkthrough

Uh oh!

qodo-code-review bot commented Dec 4, 2025

PR Compliance Guide 🔍

Uh oh!

qodo-code-review bot commented Dec 4, 2025

PR Code Suggestions ✨

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tomerqodo commented Dec 4, 2025 •

edited by qodo-code-review bot

Loading